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Shannon[T] and Yavuz[5] estimated the entropy of real documents. This note 
drives an upper bound of entropy from the power law. 

Let n be a set of documents and fr{w) be the number of occurrences of a 
word w in the document. Given an interger k, we denote by F{k) the number 
of words whose frequency is k, i.e., F{k) = Iju; G D \ fr{w) = k}. The 
entropy of the document set D is defined by E{D) = Tijjj^up(w)log{p{w)), 
where p{w) = fr{w)/N and N = 'Euj^jyfr(w). 

Theorem If F{k) follows the power law: log{F(k)) = —a*log(k)-\-hvf\ih a > 1, 
then E{D) < + 1). 


Proof Since a > 1, we have —log{k) + - < log{F{k)) < —log{k) + b. Thus we 

have, ^ < F{k) < ^ and < F{k) * k < e^. Let M be the maximal word 
frequency. The the sum of word frequencies N can be evalated as follows. N = 
'Ewfr{w) = T,^^^Efr(w)=kfr{w) = = 'E^^^E(k) *k > 

= Me^ Therefore, we have N > Me~ and ^ < e“o. Now, we can evalu¬ 
ate the entropy as follows: E{D) = —Ewp{w)log{p{w)) = — log[ FM ^ 
= ElogiE) < 

— /y log{x)dx <—e^ ° log{x)dx < —e^ [x{log{x) — 1) + C]q “ 
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